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A Bayesian  Comparison  of  Different 
Classes  of  Dynamic  Models  Using 
Empirical  Data’'^ 

R.  L.  Kashyap'*' 

Abstract 

This  paper  deals  with  the  Bayesian  methods  of  comparing  different  types 
of  dynamical  structures  for  representing  the  given  set  of  observations. 

Specifically,  given  that  a given  process  y(*)  obeys  one  of  r distinct 
stochastic  difference  equations  each  involving  a vector  of  unknown  parameters, 
we  compute  the  posterior  probability  that  a set  of  observations  {y (1 ) , . . • ,y (N) } 
obey  the  i th  equation,  after  making  suitable  assumptions  about  the  prior 
probability  distribution  of  the  parameters  in  each  ciass.  The  difference 
equations  can  be  nonlinear  in  the  variabie  y but  should  be  linear  in  the  parameter 
vector  in  it.  Once  the  posterior  probability  is  known,  we  can  find  a decision 
rule  to  choose  between  the  various  structures  so  as  to  minimize  the  average 
value  of  a loss  function.  The  optimum  decision  rule  is  asymptotically  consistent 
and  gives  a quantitative  explanation  for  the  'principle  of  parsimony'  often  used 
in  the  construction  of  models  from  empirical  data.  The  decision  rule  answers 
a wide  variety  of  questions  such  as  the  advisability  of  a nonlinear  transforma- 
tion of  data,  the  limitations  of  a model  which  yields  a perfect  fit  to  the  data 
(i.e.,  zero  residual  variance)  etc.  The  method  can  be  used  not  only  to  compare 
different  types  of  structures  but  also  to  determine  a reliable  estimate  of  spectral 
density  of  process.  We  compare  the  method  in  detail  with  the  hypothesis  testing 
method,  maximum  entropy  spectral  analysis  method  and  other  methods  and 
give  a number  of  illustrative  examples. 
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I.  INTRODUCTION 

For  determining  an  appropriate  representation  for  a given  time  series,  one 
usually  assumes  a certain  structure  involving  a set  of  parameters  and  the  values 
of  these  parameters  are  estimated  from  the  given  data.  For  many  of  the  geo- 
physical or  hydrological  time  series,  the  physics  of  the  problem  Is  not  well 
understood  to  specify  a unique  structure  for  the  stochastic  process.  In  these 
cases,  one  of  the  main  reasons  for  the  construction  of  models  is  for  understand- 
ing the  dynamics  of  the  process.  Hence  instead  of  restricting  ourselves  to  one 
particular  structure  like  autoregressive  structure,  a linear  combination  of 
orthogonal  polynomials  in  time  t or  a fourier  series,  we  should  endeavor  to 
quantitatively  compare  the  validity  of  these  widely  different  structures  for  repre- 
senting the  given  data.  For  every  structure,  we  should  try  to  determine  the 
probability  that  the  given  set  of  measurements  could  have  come  from  that  struc- 
ture and  choose  that  structure  which  has  the  highest  probability. 

We  will  first  clarify  the  notion  of  structure  and  model  because  these 
words  are  used  in  widely  differing  contexts.  Consider  a stochastic  process 


y(')  e t/ which  obeys  a stochastic  difference  equation  (1.1). 

fj(y(t))  = gj (t,y (t-i) , . . .,y (t-m. ) ,e)  + w(t)» 
where  f,  etc.  obey  the  following  assumptions  (Al)  - (A3). 


(l.i) 
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(Al)  f.(y)  is  any  differentiable  function  of  y such  as  in  y + C,  y + C,  y + C, 
etc. ; f j : y -►  R. 

The  constant  C is  chosen  so  that  the  empirical  mean  of  f.(y)  with  the  given 

N 

set  of  observations  C(N)  “ (y (1 ) , . . . ,y (N) ) is  zero,  i.e.,  C = (1/N)  ^ f,(y(t)). 

T ^i 

(A2)  Let  0=(e,,...,0  ) cR  . g.  is  any  function  linear  in  0,  but  not 

~ I n j I 

necessarily  linear  In  y(t-1),  t etc.  m.  is  any  integer  greater  than  or 
equal  to  zero. 

mj  n . 

g.:  j X.  y xR'-*-R. 
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T = {1 ,2,3, . . . }. 

"i 

when  m.  = 0,  the  function  g.  is  defined  as  a map  Qj:  t x R R. 

The  parameters  m^  and  n.  characterizing  the  function  g^  will  be  of  particular  interest 
iater  in  developing  decision  rules. 

(A3){w('))is  a sequence  of  1.1. D.  variables  with  distribution  N(0,p)  and  w(t)  is 
independent  of  y(t-j),  j ^1,  0 < “. 

A model  Is  a tuple  {fj,  g|,6,p}  uniquely  associated  with  the  difference 
eq.  (1.1).  A process  y(‘)  is  said  to  obey  a model  {fj,  gj,  0,p}  if  it  obeys 
the  associated  difference  equation  (1.1). 

Let  Cj  be  a class  of  models  having  the  same  functions  f and  g but  differing 
in  0 or  p. 


C.  A (f I ,g| ,m. , j} 

* {(fj.  9j.  §.P):  6 0 p < »},  (1.2) 

where 

T "l 

0,  = {0  - (0,,...,0^  ) , 0,  I*  0 V.,  e e R '}  (1.3) 

i 

C.  is  labelled  as  the  ith  class  or  1th  structure.  Two  classes  Cj  and  Cj  are  said 


to  be  mutually  exclusive  if  they  differ  in  the  functions  f or  g,  in  a nontrivial 
way  i.e.,  there  is  no  process  which  obeys  a model  In  C|  as  well  as  a model  in 

C.. 

J 

Given  r mutually  exclusive  classes  or  structures  C^,...,Cj.  and  a set  of 
observations  C = (y (1 ) , . . . ,y (N) } our  intention  is  to  develop  a decision  rule  to 
assign  the  observation  set  to  one  of  the  classes  among  C.,  1 ■■  1 , . . . ,r  so  as  to 
minimize  a suitable  criterion  function  and  also  determine  the  probability  of 


error  associated  with  the  decision.  The  loss  function  can  be  chosen  to  reflect 
the  particular  needs  of  the  problem,  such  as  forecasting  or  estimation  of 
spectral  density. 


k 

We  recall  that  the  only  restriction  on  the  classes  Cj  is  that  the  functions 
g.,  i = 1,2,...  must  be  linear  in  6.  Hence  the  theory  allows  us  to  simultaneously 
compare  widely  different  structures  or  classes  such  as  (i)  the  class  of  autoregressive 
(AR)  models  of  order  m^  In  y,  (11)  the  class  of  AR  models  of  order  m2  in  In  y,  (iii) 
the  class  of  models  made  of  m^  orthogonal  polynomials  in  the  variable  t,  (iv)  class 
of  harmonic  models  involving  certain  frequencies,  using  the  same  data. 

Problems  of  this  type  are  referred  to  as  compound  hypothesis  testing  in  the 
statistical  literature  [2]  and  there  is  no  general  theory  to  handle  them.  There  are 
specific  tests  for  comparing  specific  pairs  of  classes  such  as  classes  of  auto- 
regressive models  of  2 different  orders.  Even  here  the  solution  is  unsatisfactory 
for  a number  of  reasons  such  as  (i)  the  choice  of  arbitrary  quantities  like 
significance  levels  (ii)  lack  of  any  measure  of  the  type  II  error  involved,  (the 
chosen  significance  level  places  an  upperbound  on  the  probability  of  type  | error) 

(iii)  the  intransitivity  of  the  decision  rule  and  its  lack  of  any  optimality 

properties  etc.  There  are  also  many  other  adhoc  rules  such  as  likelihood  ratio 
rule  [2]  or  MAIAC  [3]  which  do  not  usually  possess  asymptotic  consistency  or  give 
a measure  of  the  probability  of  error  of  the  decision. 

Some  common  problems  occurring  in  the  analysis  of  empirical  time  series  are 
really  problems  regarding  the  appropriateness  of  the  different  types  of  structure 
for  explaining  the  data.  It  is  being  increasingly  realized  that  the  traditional 
methods  of  spectral  analysis  involving  the  use  of  window  functions  like  those  of 
Bartlett  or  Hamming  [6]  often  lead  to  misleading  inferences.  To  overcome  such 
objections,  geophysicists  use  a special  method  of  spectral  analysis  known  as  maximum 
entropy  spectral  analysis  (MESA)  whose  Inferences  have  greater  reliability  than 
those  of  the  traditional  methods.  The  method  effectively  assumes  an  autoregressive 
(AR)  structure  for  the  process.  If  we  pose  the  problem  as  one  of  comparing 
different  classes  of  models  having  different  orders  of  AR  models  as  done  here. 


we  will  obtain  a solution  which  overcomes  many  of  the  disadvantages  of  the  MESA 


method,  without  sacrificing  the  advantages.  Moreover,  we  can  choose  the  decision 
rule  to  minimize  a loss  function  which  explicitly  reflects  the  needs  in  the  con- 
text namely  minimizing  the  errors  in  the  estimation  of  spectral  density  of  the 
process.  Similar  problems  are  faced  in  the  spectral  analysis  as  applied  to  the 
processing  of  acoustic  signals  also  [7]. 

In  our  paper  we  will  first  compute  P(C.|c),  the  posterior  probability  of 

the  given  data  having  been  generated  by  some  model  In  C.,  for  every  i = I 

assuming  a suitable  prior  distribution  for  the  classes  and  the  various  parameters 
Subsequently  we  will  derive  optimal  decision  rules  according  to  various  types  of 
criteria  and  discuss  the  asymptotic  consistency  of  the  decision  rule.  We  will 
give  many  types  of  examples  such  as  (i)  whether  a nonlinear  transformation  of 
the  data  will  yield  a better  representation  of  the  data,  (ii)  when  is  a 
polynomial  fit  of  the  data  (or  a fourler  series  representation  of  the  data) 
with  zero  residual  variance  inferior  to  a relatively  small  order  autoregressive 
representation  etc.,  and  point  out  the  superiority  of  the  method  to  existing 
methods  of  comparison  like  hypothesis  testing  [1,2],  maximum  entropy  spectral 
analysis  [^,5,10],  etc. 

Even  If  all  the  classes  have  the  same  prior  probability,  the  posterior 

probability  of  class  C.  being  the  correct  class  for  the  observation  set  C(N) 

having  N observations  is,  under  appropriate  assumptions 
exp[0.5  h. (E(N))1 

P(C,  |5(N))  = -p ! , i = l,...,r  (I.l4) 

j exp [0.5  h.  (e(N))1 
k=l 


h,(C(N))  = 2N  In  f|  - (N-m.)  In(pj  + (Pf,-p,)/N)  - n,  In  N 

- mj  In  p^j  + m.  - (l/P|:.)  (fj(t))^  + 0(l/N)  (1.5) 

i •*  1,2,..., 


r 
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where  In  f.'  is  the  empiricai  mean  of  the  function  ln[d  f.(y)/dy]|y  “ y(t)  , p^j 
is  the  empirical  variance  of  fj(y(t))  and  p.  is  the  residual  variance.  The 
expression  (1.5)  clearly  brings  out  the  adverse  effects  of  dealing  with  classes 
with  large  nj,  i.e.,  having  too  many  parameters  to  be  estimated.  The  role 
played  by  the  lag  variable  m.  and  the  number  of  estimated  parameters  nj  is  quite 
different  in  general  even  though  in  the  widely  discussed  case  of  autoregressive 
processes  m.  = n..  Further,  even  if  p.  = 0,  as  in  polynomial  models, 

P(C.|c(N))  is  still  finite.  These  and  other  features  will  be  discussed 
extensively  in  the  subsequent  sections. 


II.  ASSUMPTIONS 

For  simplicity  we  will  denote  a class  C.  by  the  k tuple  [f j , gj,  m. , n.]. 

The  suppressed  term(^.  will  be  mentioned  wherever  necessary.  Further  f . (t)  A 
fj(y(t)).  Similarly  the  function  g.  will  be  written  as  g.(t,0)  suppressing  the 
other  arguments.  Moreover  g.  can  be  written  as: 

g.(t,0)  = (z.(t-l)/0,  where  z.  is  independent  of  , 
zT(t-l)  = V g.(t,0),  n.  - vector 

I 0 1^  I 

We  consider  r classes  Cj,  i = l,...,r,Cj  = [f.,  g.,  m.,  nj]  where  f.  and  g. 
obey  the  conditions  (A1)  - (A3).  We  will  make  the  following  additional  assumptions 
on  the  functions  f,  g etc. 

(A^)  The  functions  f.  and  fj  and  g.  and  g^  obey  at  least  one  of  the  following 
conditions  for  every  possible  pair  (i,j),  i / j,  i,j  * l,...,r 

(i)  fj(y)  ^ kf j (y)  + C for  any  k,C  and  almost  all  y 

(ii)  For  every  0 e(0|,  there  does  not  exist  a 0 e satisfying  the 

following  relation 

g.  (t,y(t-1),.  ..,y(t-m.),0)  = gj(t,y(t-1) y(t-m^.),0‘) 
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(A5)  If  a process  y(*)  belongs  to  one  of  the  classes  C.,  I = 1,2 then 

N T- 

[1/N  ^ z. (t-1 )zl (t-1 ) ] is  finite  and  positive  definite  for  all  N and 

t=m.  + r^'  '' 

i = 1 r . 

We  need  the  following  assumptions  regarding  the  prior  probability  distri- 
bution of  6 and  p and  the  probability  distribution  of  the  initial  conditions  for 
eq . (1.1). 

(A6)  Let  x(t)  = fj(y(t)).  The  m.  initial  values  y (1 ) , . . . ,y (m. ) of  a process 
y(‘)  obeying  any  model  in  C.  obeys  the  following  normal  density 
p(x(l) .... ,x(mj))  - N[0,Rj] 

where  0 is  the  null,  vector  of  dimension  m.  and  R.  is  a m.xm.  covariance  matrix. 
(A7)  If  y(’)  obeys  a model  in  C.  involving  m.  lagged  variables,  then  the  prior 
density  of  0 and  p obeys  the  following  relation: 

pC0.ply(O y(mj).Cj)  = p(,0,p|c.) 

(a8)  The  variable  p has  the  following  probability  dens  i ty:  p (p  | C.)  = a/p,  a > O^valid 
for  al i i . 

(A9)  The  variable^  p has  the  following  conditional  prior  density  given  p: 

p(e|p,c.)  - n(0q,,Sq.p) 


where  S„.  is  a 
-Oi 


(AlO)The  prior  probability  of  class  C.  is  P(C.),  i = l,2,...,r,  0 < P(C.)  < 1, 
P(C.)  = 1. 

We  will  discuss  the  assumptions.  The  assumption  (A4)  and  condition  (1.3) 
are  sufficient  for  the  classes  Cj,...,C^  to  be  mutually  exclusive,  i.e.,  there 
does  not  exist  a process  y(*)  which  obeys  2 different  models  in  2 different 
classes.  (A5)  is  needed  for  the  existence  of  the  matrices  occurring  in  the 
optimal  decision  function.  To  understand  (A6)  and  (A7),  we  should  note  that  we 
have  not  made  any  assumption  about  the  stationarity  of  the  process  y obeying  any 
model  in  C.  for  any  i.  Specifically,  if  y(‘)  belongs  to  C.,  then  (y  (1 ) , . . . ,y  (N) } , 


the  observation  set  is  generated  as  follows:  y(1) y(ni|)  act  as  initial 

conditions.  Based  on  y(l) y(mj)  and  w(m.+l),  y(mj+l)  is  generated  from 

(l.l),  and  y(t),  t > m.+l  are  generated  recursively  using  (l.l).  Clearly  the 
Initial  conditions  y (I ) , . . . ,y (mj ) cannot  throw  any  light  on  the  parameters  6 and 
P which  characterize  that  particular  model  in  Cj  obeyed  by  y(*)>  This  statement 
Is  the  assumption  (A7). 

The  assumption  (A3)  yields  the  conditional  probability  density  p(y(mj+l),.. 
y(N) |y(l),...,y(m.)e,p,C.)  as  shown  below.  This  expression  In  conjunction  with 
(A6)  and  (A7)  yields  the  joint  probability  density  of  all  the  observations 
namely  p(y (1 ) , . . . ,y (N) |0,p,C) . The  details  are  indicated  below. 

p(x. (mj  + l ) X. (N) |x, (1 ) , . . . ,x. (mj ) ,0,p,C j) 

N 

= n p(x, (t) |x, (t-1 ) , . . . ,x. (I ) ,0,p,C) , by  (A3) 
t=m . + 1 ' 


N 2 

= n exp[-l/2p(x.(t)  - g.(t,0))^],  by  (A3)  (2.1) 

t=m.+l  /2irp 

Transforming  the  variables  x into  y by  the  relation  x. (t)  = fj(y(t)),  (2.1) 
yields : 

P(y (m.+l) .... ,y(N) |y(l ), ... ,y(m.) ,0,p,C.) 

= n (f.(t)//2TTp)exp[-l/2p(f.(y(t))  - g.(t,0))^],  (2.2) 

t=m.+l  ' 

where  f.(t)  = (d  fj(y)/dy)|y  = y(t) 

p(y (1) , • . . ,y (m. ) |0,p,c. ) = p(y(l) , . • • ,y(mj) |Cj) , by  (A7) , 
m. 

n f| (t) 

= ^^72 ^exp[-(l/2)||(f,(l) f,(m,)||^  by  (A6) 

(2it)  * (det  Rj)  *^i  (2.3) 

Multiplying  the  expressions  (2.2)  and  (2.3)  yields  the  required  density 

p(y(l) y(  N)  |0,p,Cj). 


The  prior  density  of  0 given  p In  (A9)  Is  so  chosen  that  the  posterior 

density  p(6|p,^,C.)  Is  also  Gaussian.  There  Is  an  extensive  literature  on 

the  choice  of  the  prior  density  of  p.  The  density  p(p)  given  In  (A8)  Is  commonly 

used  In  statistical  literature  and  can  be  defended  on  many  different  grounds  [8]. 

00 

However  It  Is  called  Improper  since  / p(p)dp  does  not  exist  for  a^  ■ 0.  But  In 

®l 

computing  the  posterior  density,  we  can  allow  the  limit  a^  to  be  zero. 

We  will  offer  some  suggestions  for  the  choice  of^QI  and^Qj,  occurring 

In  the  prior  density  p(Gi|p,C.).  The  only  guideline  available  for  the  choice 

of  Sqj  Is  that  It  be  relatively  large  In  view  of  the  great  Initial  uncertainty 

regarding  the  value  of  0 appropriate  for  the  given  data.  Even  though  the  effect 

of  0QI  and  are  asymptotically  Insignificant  on  the  optimal  decision,  still 

the  arbitrary  choice  of  0qj  and  S^.  for  various  I may  make  the  computation  of 

the  p(C. |e)  more  cumbersome  than  it  need  be.  Accordingly  we  propose  that  the 
' # 

following  choice  for  S_.,  I = l,...,r. 

ui 

(All)  Sq.  = [l/N-m.  I Zj (t-l)zj(t-l)]'’ 

t=m.+l 


The  choice  In  (All)  Is  unconventional  In  the  Bayesian  literature  since  it 
depends  on  the  observations.  However,  we  will  show  later  that  such  a choice 
implies  the  following  expression  for  the  posterior  variance  of  0 given  5 and  P 


Var[0|E,p,Cj]  A Sj  - SQj/(N-m.+l) 


(2.4) 


our  choice  is  reasonable  since  §q|P  is  much  larger  than  the  posterior  variance 
SjP.  Any  other  choice  for  the  S^.  would  have  made  the  expression  for  Sj  more 
complicated  than  the  one  in  (2.4). 

Next  let  us  turn  our  attention  to  the  choice  of  the  vector  0jq  occurring 

In  p(0|p,C.).  The  obvious  choice  for  §10  **  the  null  vector jStated  In  (AI2). 

(AI2)  0JQ  ■ [0,...,0]^,  (nj-vector) 

It  is  Important  to  realize  that  (AI2)  Is  valid  only  If  f,(’)  Is  chosen 

N 

as  stated  In  (Al)  i.e.,  I/N  [ f,(y(t))  « 0.  Otherwise,  (A12)  will  be  Inconsistent 

t-l 

with  the  fact  that  mean  of  w(*)  Is  zero  In  eq.  (I.l). 


Finally  consider  the  covariance  matrix  R.  occurring  In  (A6)  i.e.,  the 
probability  density  of  the  initial  conditions  y (1 ) , . . . ,y(m. ) . The  effect  of 
this  term  on  the  final  decision  rule  is  not  strong.  Hence  to  simplify  the 
computation,  we  make  the  assumption  (A13)- 
(A13)  R.  = Pf.l 

where  p^.  = empirical  variance  of  f.(y(t)). 

III.  THE  POSTERIOR  PROBABILITY  P(C.|c) 

Let  C = {y(D y(N)} 

P(C.  |C)  = p(c|C.)  P(Cj)/p(c) 

= /”  dp  / dle|  p(c|e,p,c.)p(e,p|c.)p(c,)/p(E)  (3.1) 

The  expression  for  p(c|0,p,Cj)  has  been  derived  In  section  II.  p(6,p|C|)  Is 
available  from  assumptions  (A8)  and  (A9) . Hence  the  integration  in  (3.1)  can 
be  performed  as  indicated  in  the  appendix  1 leading  to  the  following  theorem  1 
Theorem  1 ; Under  the  assumptions  (A1)-(A13) »the  posterior  probability  P(C.|e) 
has  the  following  form 

P[C.|C]  = K exp  [0.5  h.(E)] 

where 

r 

K = 1/  J exp[0.5  h.  (C)] 
i=l  ' 

h.(?)  = 2N  IrTTy  - (N-m.)  In  (p  j + (p^:j-Pj)/N) 

+ 2 in  P(Cj)  - nj  In  N - mj  In  p^j  + 6j(mj)  + 0(1/N) 

"i 

G,(m,)  - m,  -(t/p,,)  I (f,(t))^ 

N T ? 

Pj  = l/(N-m.)  I (f j (t)-zj(t-l)e*)^ 
t=m.+l 
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N , N 

0^*  = [(N-m,  + 1)/(N-m.)H  I z,  (t-1  )zT  (t-1 ) ]' ' I Zj(t-1)f,(t) 

t=m.+1  t“m,+1 

I I 

N 

In  fT  = (1/N)  ^^1n  f.(t),  fj(t)  = d ^ |y.y(t) 

N , 

Pj;.  = (1/N)  ]►’  (f.(y(t))  A empirical  variance  of  {f  j (y  (t) ) , . . . ,f  j (y  (N) ) } 
t*1 

A proof  of  Theorem  1 is  in  appendix  1. 

Comment  An  expression  for  the  P(C-)C(N))  without  the  assumptions  (A11)- 
(A13)  regarding  the  parameters  of  prior  distribution  is  given  in  lemma  1 in 
appendix  1.  Obviously  it  is  computationally  more  complicated. 

Comment  2:  E[G^(m.)|c.]  = 0 

Variance  [Gj(m.)|C.]  « 2m. 

Hence  Gj(m.)  is  of  the  order  0(/m.).  While  comparing  2 classes  having 
different  values  of  m. , the  term  6j(m.)  does  not  make  much  difference  in  compari 
son  with  other  terms  and  hence  can  be  neglected  when  N is  large. 

A /c 

Comment  3:  The  model  (f.,  g.,  6.,  p,)  is  the  best  fitting  model  in  the  class 

— I I \/>  I I 

C|  for  the  given  data  C-  Alternatively,  if  C obeys  some  (unknown)  model  (fj, 
g.,  0,  p)  in  C.,  then  0*  and  p*  are  the  Bayesian  estimates  of  0 and  p. 

We  will  discuss  many  other  features  of  the  posterior  probability  function 
P(C. |c)  in  section  V. 

IV.  OPTIMAL  DECISION  RULE  AND  THEIR  CONSISTENCY 
1 . Optimal  Decision  Rules: 

Let  d(c(N))  be  a decision  rule  where  d is  a map: 

d:  {C  ...,C  } 

I n 


Consider  a loss  function  L which  reflects  the  cost  of  wrong  assignment 
of  the  observation  set  C to  a class  Cj  using  the  rule. 

L(C.,d(5)  = Cj)  = 0 if  C,  = Cj 

= w.j  > 0,  if  C|  / Cj 

Our  intention  is  to  choose  the  decision  rule  to  minimize  the  average  value  of 
the  loss  function  L,  i.e.,  minimize  J(d) 

J(d)  = EtL(C.,d(c))] 

= I P(C.)  / L(C..d(5))P(c|C.)d|E| 
i=l  ' ' ' 


j(d(E)  * C ) = /(  I w P(C.|c))p(?)  d|c| 

J i=rl  'J  ' 

The  optimum  decision  rule  is 


d*(0  = C,,  = 


. = Argument  Minimum  ^ w..P(C.|e) 
CjC{C, C^}  i = l ^ 


The  loss  function  or, in  particular , the  weights  w.j  can  be  chosen  to  reflect 
the  particular  needs  of  the  problem.  If  we  are  interested  in  minimizing  the 
probability  of  error  in  the  assignment  of  class  to  C,  the  choice  of  Wjj  is 
wjj  » 0 if  i - j 

= 1 if  i j 


In  that  case,  the  optimum  decision  rule  is; 

d*(c)  = C.  * Argument [Maximum  P(C?|C)] 

CjE{C, 


(4.2) 


i.e.  The  decision  rule  assigns  C to  the  class  having  the  highest  posterior 
probabi 1 i ty. 

Probability  of  error  of  the  optimal  decision  rule  (4.2)  = [1-Max  P(C.|€)] 

(4.3) 

On  the  other  hand,  if  our  sole  interest  in  class  selection  is  to  obtain  a 
reliable  estimate  of  spectral  density,  then  we  should  choose  ^ as  follows 
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“n 

Wj.  » / d(o(ln  S.(to)  - In  S.  (oi)) 

J -0.^ 

where  S|((ii)  is  the  "best"  estimate  of  the  spectral  density  of  the  process  based 
on  C given  that  the  process  C belongs  to  class  Cj.  In  that  case,  the  optimum 
decision  rule  (4.1)  simplifies 

d*(C)  = Cj  if  (In  P(Cjlc)-H)^  < (InP(CjU)-H)^  V i j , (4.4) 

wi  th 

H = I P(C. |C)  lnP(C. |5)  (4.5) 

i = 1 ' 

The  2 illustrations  should  be  sufficient  to  reveal  the  power  of  the  decision 
rule  to  reflect  the  needs  of  the  particular  problem. 

2.  Consistency  of  the  Decision  Rule; 

We  will  show  that  the  optimum  decision  rule  in  (4.2)  Is  asymptotically  con- 
sistent. The  consistency  of  other  optimum  decision  rules  such  as  (4.4)  can  be 
established  in  a similar  manner.  Without  any  loss  of  generality,  let  us  consider 
the  comparison  of  only  2 classes  Cj  and  If  the  process  y(*)  comes  from  some 

(unknown)  model  in  class  C^,  then  we  will  show  that  P(C2 1 C (N) )/P (C j 1 5 (N) ) tends 
to  + ~ as  N tends  to  infinity  showing  that  the  decision  rule  correctly  classi- 

fies the  observation  set.  A precise  expression  for  the  P (C2 | C (N) )/P (C^ | C (N) ) Is 
given  in  the  following  theorem  2. 

We  should  emphasize  that  the  asymptotic  behavior  of  P (C2 1 5 (N) )/P (C j | C (N) ) when 
C2  is  the  correct  class  may  be  quite  different  from  the  asymptotic  behavior  of 
P(Cj 1c(N))/P(C2|c(N))  when  is  the  correct  class. 

Theorem  2:  Consider  a pair  of  mutually  exclusive  classes  Cj  and  C2,  C| 

{f j ,gj ,mj ,nj ,i0j } under  the  assumptions  (AI)-(A13)«  Assume  that  the  given 
process  y(*)  obeys  a model  (f2f92i®2’^2^  ^ ^2  -2  ^2 


P2  0- 


1 


Case  (I)  Let  f 


^2* 


n,  > 


(4.6) 

(4.7) 


For  every  B e{^,  there  exists  a 0 e R so  that  (4.8)  is  satisfied. 
9)  (t,0)  “ 92(t,e')  (A. 8) 


Then 


LIm  [P(C,U(N))/P(C,IC(N))]'^'"^  - exp[(n,-n2)/2] 


N->« 


(4.9) 


Case  (II)  fj  » f^,  but  n^,  and  gj  and  do  not  obey  either  (4.7)  or  (4.8). 


Then 


LIm  [P(C-|C(N))/P(C,U(N))]’^^  - k,  > 1 


(4.10) 


N-h» 


Case  (III)  ^2*  redefine  the  variable  y so  that  fj(y)  “ y + kj , 

relabel  f2(y)  as  f(y).  Assume  f(*)  obeys  the  following  assumption  (BI). 

(Bl)  If  f(y)  is  normal,  then 

E[(y-y)^]  1 E[(f(y)-f(y))^]/exp[2E  in  f(y)] 
where  y and  f(y)  stand  for  mathematical  expectations  of  y and  f(y)  and  f is 
the  derivative  of  f(y). 

Then  the  limit  given  in  (4.10)  is  also  valid.  Q.E.D. 

The  theorem  is  proved  in  appendix  2. 

The  assumption  (Bl)  appears  to  be  obeyed  by  most  differentiable  functions 
such  as  f(y)  r In  y + k^,  f(y)  = y 4^  etc.  Still  we  have  stated  it  as 
assumption  since  we  do  not  have  a proof  of  it. 

Case  (i)  is  valid  for  a pair  of  classes  C^  and  C2  where  C2  is  obtained  from 
Cj  after  setting  certain  components  of  the  parameter  vector  8 in  Cj  to  zero,  and 
the  true  process  obeys  C2*  A common  illustration  is  C2  in  a class  of  AR  models 
of  order  n2  and  Cj  is  a class  of  AR  models  of  order  n^  nj  > 02,  with  the  correct 
model  belonging  to  C2.  In  such  a case,  theorem  2 roughly  states 

P(C2U(N))/P(C,  |C(N))  - exp(^^^y-2-  • In  N)  - 
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or  posterior  error  probability  A P(Cjlc(N))  = N 


/n,-n,)/2 


in  ai)  other  cases,  covered  by  Cases  (ii)  and  (iii) 
P(C,1C(N))/P(C, |C(N))  - k2,  > i 


or  posterior  error  probability  = P(C,|c(N))  - k, 


(4.12) 


Eq.  (4.11)  clearly  illustrates  the  empirically  known  fact  that  it  is  easier  to 
distinguish  2 classes  with  entirely  different  structures,  coming  under  Cases 
(ii)  or  (iii)  than  to  discriminate  between  2 classes  with  similar  structures 
where  the  structure  of  the  difference  equation  in  C2,  the  correct  class  can 
be  obtained  from  that  of  Cj  by  setting  a few  parameters  In  It  to  zero,  I.e.,  it 
is  difficult  to  distinguish  the  correct  class  C2  from  the  higher  order  class 
Cp  Note  that  if  Cj  and  C2  have  similar  structures,  but  n^  < n2»  then  we  have 
an  example  of  Case  (i)  and  the  error  probability  decays  exponentially. 

V.  DISCUSSION  AND  COMPARISON 

1 . Main  Character  1st ics  of  the  Decision  Rule; 

We  will  highlight  some  of  the  important  features  of  our  decision  rules. 

Most  of  these  features  are  absent  in  the  decision  rules  based  on  hypothesis 
testing  and  other  methods  which  will  be  discussed  subsequently. 

(PI)  The  decision  rule  can  compare  simultaneously  many  classes  obeying  the 
conditions  in  Sections  I and  II. 

(P2)  The  decision  rule  is  transitive,  I.e.,  Let  C.^  Cj  denote  that  in  com- 
paring the  classes  C.  and  C j , the  decision  rule  prefers  Cj  to  C j . Then  Cj  V ^2 
and  C2 ^ Cj  ”5»Cj  V provided  all  the  classes  are  equiprobable. 

(P3)  The  decision  rule  is  asymptotically  consistent,  I.e.,  while  comparing 

2 classes  C,  and  C>  based  on  the  observation  set  E(N),  then  Urn  P(C.|C(N))/ 

' ^ N-w»  ^ 

P(C,U(N))  -►  <»  If  the  observation  set  comes  from  some  (unknown) member  In  C,  and 
’ -(n,-n2)/2  ^ 

the  error  probability  P(C.|c(N))  decays  at  least  as  fast  as  N If 

« * 

nj  > n2  or  k2  where  k2  > 1 • 
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(P4)  The  decision  rule  is  optimal  in  the  sense  that  it  minimizes  a 
suitable  ioss  function  which  reflects  the  needs  of  the  particular  problem. 

(P5)  An  explicit  expression  is  available  for  the  probability  of  error 
in  the  decision  given  by  the  rule. 

(P6)  The  only  arbitrary  quantities  appearing  in  the  decision  rule  are 
the  prior  probabilities  of  the  classes.  There  are  no  other  arbitrary  quantities 
like  significance  levels.  The  effect  of  the  assumption  about  the  prior  probabili- 
ties is  asymptotically  negligible  unlike  the  significance  levels  used  in  the 
hypothesis  testing  methods.  Typically  the  prior  probabilities  of  the  classes 


can  be  assumed  to  be  equal. 

(P7)  The  roles  played  by  m.  and  Oj  in  the  decision  rules  are  quite 
different.  The  contribution  of  terms  involving  m.  to  In  P(Cj(c)  Is  0(1)  where 
as  the  contribution  of  terms  involving  nj  is  0(ln  N) . 

(P8)  The  posterior  probability  P(Cj5)  involves  a term  exp[-nj  In  N]  even 
if  all  the  r classes  are  a priori,  equiprobable.  When  we  are  comparing  2 classes 
Cj,  i = 1,2  such  that  they  yield  the  same  value  of  p.  and  In  f 1 , i = 1,2, 

the  posterior  probability  of  the  class  having  smaller  n.  will  tend  to  1 as  N tends 
to  Infinity.  This  Is  a quantitative  proof  of  the  "principle  of  parsimony"  which 
states  that  if  2 models  explain  the  same  data  (i.e.,  have  same  p),  the  model 
involving  smaller  number  of  estimated  parameters  has  a higher  plausibility  of 
being  the  correct  model  of  the  process  than  the  other  model. 

(P9)  The  decision  rule  for  comparing  classes  is  valid  even  if  one  of  the 
members  of  a class  yields  a zero  residual  variance  (i.e.,  p.  ■ O)  with  the  given 
data.  This  is  possible  because  p always  appears  in  the  decision  rule  in  the 
form  ln[p  + (p^-p)/N],  Hence  even  if  p is  zero  or  very  small,  ln[p  + (p^-p)/N] 
is  still  finite.  This  property  is  very  useful  In  comparing  stochastic  models 
like  autoregressive  models  Involving  a small  number  of  parameters  with  polynomial 
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or  fourier  series  models  which  involve  a large  number  of  parameters,  but  yield 
low  residual  variance. 

2.  Comparison  with  the  Hypothesis  Testing  Approach: 

This  approach  does  not  possess  most  of  the  properties  (P2)“(P9).  The 
hypothesis  testing  approach  is  usually  designed  to  test  whether  certain  com- 
ponents of  vector  0 in  the  difference  equation  of  form  (l.l)  are  not  signifi- 
cantly different  from  zero  (hypothesis  Hq  or  null  hypothesis)  or  the  contrary 
(hypothesis  Hj).  Consequently  the  method  can  be  used  to  compare  only  2 classes 
at  a time  and  these  classes  form  a small  subset  of  the  classes  mentioned  in 
(PI).  A typical  application  is  when  eq.  (l.l)  is  a n^  order  autoregressive 
(AR)  equation  and  the  null  hypothesis  is  that  the  given  process  y(*)  obeys 
a n2  order  AR  process,  '’2  ^1’  '"®‘  hypothesis  is  that  the  coefficients 

of  y (t-n^+l ) , . . . ,y (t-nj ) in  (1.1)  are  all  zero.  Even  here,  we  will  show  in 
example  4 that  the  decision  rule  is  not  always  transitive  (property  P2)  and  the 
decision  rule  is  not  always  asymptotically  consistent  (property  P2) . Hypothesis 
testing  is  routinely  used  in  problems  where  the  danger  of  rejecting  when 
is  true  (type  I error)  is  very  much  greater  than  accepting  Hq  when  is  not 
true  (type  II  error).  The  decision  rule  is  designed  to  place  an  upper! imit  on 
the  probability  of  type  I error,  but  no  measure  of  the  type  II  error  is  available 
(property  P5).  Thus  the  decision  rule  is  inappropriate  for  problems  when  both 
types  of  errors  are  important.  The  decision  rule  involves  an  arbitrary  parameter 
like  significance  level  (i.e.,  probability  of  type  I error  allowed),  (property 
P6).  In  general,  the  decision  rule  is  not  optimal  in  the  sense  that  any 
specific  criterion  function  is  minimized  (property  P4).  Further  the  hypothesis 
testing  cannot  handle  a relatively  simple  comparison  problem  such  as  whether 
normal  distribution  or  log  normal  distribution  is  appropriate  for  representing 
the  given  data  or  whether  an  autoregressive  process  or  a polynomial  fit  (or 
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orthogonal  function)  is  appropriate  for  representing  the  given  data  (property 
P9) • Examples  1-i»  bring  out  the  limitations  of  the  hypothesis  testing  approach. 

3.  Other  Ad-Hoc  Procedures: 

There  are  many  adhoc  procedures  for  comparing  several  classes  of  model.  The 

criterion  MAIAC  [3]*  is  used  for  comparing  classes  of  autoregressive  models  of 

different  orders.  For  each  class,  we  compute  aj(?)  = -N  In  Pj  - 2n^,  where 

p.  = residual  variance  of  the  best  fitting  model  in  class  C. 

n.  = number  of  parameters  to  be  estimated. 

Chosen  class  is  C.  = Argument  max[a.(E)l 
' C.  ' 

I 

Here  there  is  no  distinction  between  m.  and  n..  The  term  -2n.  is  inserted  into 

I I I 

the  decision  function  using  certain  ideas  of  information  theory,  but  there  seems 
to  be  no  particular  reason  for  the  factor  2.  The  decision  rule  is  transitive. 

However,  the  decision  rule  has  no  optimality  property.  We  have  no  idea  of  the 
probability  of  error  of  the  decision  rule.  Most  of  the  examples  1-4  are  outside 
the  scope  of  this  rule.  In  particular,  the  rule  cannot  be  used  to  compare  classes 
in  which  one  of  the  models  such  as  a model  with  n orthogonal  polynomials  in  t « 

gives  zero  residual  variance  for  the  data  since  aj(5)  = «. 

One  can  show  that  the  use  of  decision  rule  is  equivalent  to  the  use  of 
hypothesis  testing  procedure  at  an  appropriate  significance  level  [I]. 

Another  adhoc  approach  is  the  maximum  likelihood  approach  [2]  in  which 
we  maximize  the  likelihood  function  in  each  class  over  the  allowable  set  of 
parameters  and  choose  that  class  which  has  the  largest  maximum  likelihood  value 
among  them.  As  before  the  rule  is  not  always  asymptotically  consistent 

and  we  have  no  idea  of  the  probability  of  error  given  the  rule.  Further  it  is 
invalid  for  comparing  2 classes  of  AR  models  with  different  orders,  since  the 
maximum  value  of  likelihood  function  associated  with  the  class  with  larger 
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order  Is  usually  greater  than  that  of  the  smaller  order  AR  class.  Similarly 
It  cannot  handle  pairs  of  classes  mentioned  In  (P9).  The  examples  (1-1*) 
bring  out  the  relative  merits  of  the  various  methods. 

I*.  The  Maximum  Entropy  Spectral  Analysis  [^,5tl0l- 

The  original  aim  of  the  MESA  approach  Is  to  obtain  a reliable  estimate  of 
the  spectral  density  of  a process  y from  Its  N observations.  Instead  of  assuming 
an  arbitrary  structure  for  the  process,  they  found  a structure  for  the  process 
y(*)  to  maximize  the  entropy  function  under  the  following  2 restrictions 

(I)  y(‘)  Is  stationary  and  zero  mean 

(II)  All  the  correlations  of  the  process  up  to  mth  order  are  known,  I.e., 

(frj,  I = 0,...,m  are  known  where  ((>.  = E[y  (t)y(t-l)] . 

The  result  Is  an  autoregressive  structure  for  the  process  namely 
m 

y(t)  = a.  y(t-j)  + w(t) 
t=l 

where  w Is  a zero  mean  sequence  N(0,p)  and  aj,...,a^  and  p are  determined  from 
the  known  autocorrelations  ((>.,  I = 0,...,m.  The  required  spectral  density  Is 
the  S(o)) 

S(u)  = p/||l  - aje"'“...  - a^  i'n“||2 

In  practice,  (|)j  are  not  known  and  hence  we  replace  them  by  the  corresponding 
emperical  correlation  coefficients  computed  from  the  N observations  y (I ),..., 
y(N). 

The  key  choice  In  the  method  Is  the  Integer  m.  The  method  does  not  suggest 
a method  for  the  choice.  The  maximum  value  of  m Is  N.  Typically  m can  be  O.IN 
or  0.2N.  The  value  of  Integer  m Is  Increased  till  the  estimated  spectral  density 
shows  sufficient  resolution  In  the  required  frequency  range. 

On  the  other  hand.  If  we  use  the  approach  of  Section  III  for  the  selection 
of  m,  we  have  all  the  advantages  of  MESA  method  and  the  additional  Information 
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such  as  the  posterior  probability  of  Cj  being  correct  class,  an  estimate  of 
the  variance  of  spectral  density  estimate  etc.  which  are  not  available  in  MESA. 

In  particular  we  can  choose  the  decision  rule  to  minimize  errors  in  the 
estimation  of  spectral  density  as  indicated  in  Section  IV. 

We  will  give  ^ examples  to  Illustrate  the  relative  behavior  of  the  various 
methods  of  comparison.  In  all  the  examples,  the  test  observation  set  Is  C 
(y(l ) » • • • >y (N)) . If  C.  is  the  class,  the  corresponding  residual  variance  is 
p.  and  the  corresponding  signal  variance  is  p^j.  Cj  ^ Cj  means  that  the 
decision  rule  prefers  the  class  C.  to  Cj  when  the  associated  conditions  are 
satisfied  and  the  reverse  if  the  conditions  are  not  satisfied.  The  function  f . (y) 
is  chosen  as  mentioned  in  (Al). 

Examp  I e I : This  example  Is  used  to  Illustrate  the  empirically  observed  fact  that 

models  which  Involve  the  estimation  of  a large  number  of  parameters  ate  usually 
inferior  to  appropriately  chosen  models  involving  a small  number  of  parameters, 

I even  though  the  larger  parameter  model  may  result  in  zero  residual  variance. 

' Specifically  the  2 classes  are 

9i>  ""i  = 0.  = N} 

I “ {f , , g2»  mj  = I , n^  = 2} 

9i(t,e)  - e + i 0^^,*,^(t-l) 

k=l 

g2(t,y  (t-1 ) ,0)  = 6|  + 02y(t-l) 

^ where  !,<(>,(t)  i^-(t)...  are  a set  of  orthogonal  functions  orthogonal  over  t - 

i ' ^ 

[1,N].  They  can  be  polynomials  or  sinusoids.  C2  is  the  class  of  first  order  AR 
models.  Since  our  data  set  C has  N observations,  Pj,  the  residual  variance  of  the 
best  fitting  model  in  class  is  zero.  Let  p^^  = p^2  “ ^y  “ empirical  variance 
of  y (• ) . 


\ 


We  can  assume  >>  Py/N.  Decision  rule  (i».2)  yields 
C2  y Cj  ^if  -(N-2)  In  - 2 In  N - 2 In  py  > -N  ln(py/N)  - N In  N 
Retaining  terms  of  order  0(1)  and  higher,  we  get 
^2  )'  ^ In  N)/(N-2)] 

For  instance,  if  N = 100 

C2  y C,  if  p^  < .91  Py 

i.e.,  as  long  as  the  first  order  AR  model  explains  about  9 percent  of  the 
signal  variance,  the  AR  model  class  is  superior  to  the  class  of  models  with  N 
polynomials,  inspite  of  the  fact  that  residual  variance  p^  = 0.  The  superiority 
of  first  or  second  order  AR  models  to  polynomial  or  orthogonal  function  models 
in  modeling  many  (but  not  all)  empirical  series  is  well  known  to  the  workers 
dealing  with  empirical  model  building  beginning  with  the  work  on  sunspot  series 
[9].  However  the  superiority  of  the  AR  model  was  deirxinstrated  by  an  elaborate 
procedure  like  comparing  correl iograms  and  other  validation  methods  [1].  In 
contrast  the  present  theory  offers  a relatively  simple  quantitative  explanation 
of  this  phonemenon. 

Note  that  the  conclusion  is  the  same  if  we  had  compared  Cj  with  any  other 
class  Cj  of  models  involving  2 parameters,  say,  the  class  of  models  involving 
only  2 orthogonal  functions,  i.e.,  is  the  class  of  all  straight  line  fits 
to  the  data  in  the  plane  [t,y(t)]  ^nd  Cj  is  the  class  of  polynomial  fits. 

As  before,  we  should  prefer  the  straight  line  fit  if  it  explains  at  least  9 
percent  of  the  signal  variance  in  preference  to  the  polynomial  fit  which  yields 
zero  residual  variance. 

Example  2;  We  compare  2 classes  and  having  same  nj,  but  differ  in  f.  and 
possibly  in  g. . 
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Ci,  - ^f^,  9^,  m.  n} 

Cg  = {fy  gj,  m,  n} 

^^(y)  = y + k^,  f^(y)  = In  y + 

Let  = empirical  variance  of  y (I ) , . . . ,y (N) 

= empirical  variance  of  1ny(l) Iny(N) 

Iny  = empirical  mean  of  Iny (1 Iny(N) 

By  decision  rule  (4.2) 


l __  M ^ (Pf  c) 

C5  r S Jf  ln(P;,/P5)  > 2 Iny(^)  + 


N-m 


Note  that  the  decision  rule  asymptotically  does  not  depend  on  N explicitly. 
(It  depends  on  N via  p.  etc.). 

As  a particular  illustration  of  this  family  of  problems,  let  us  determine 
whether  a normal  distribution  or  a log  normal  distribution  is  appropriate  for 
representing  the  given  observation  set.  Thus 
C4  * ify  gi,,  m = 0,  n = 1} 

C5  = {fy  g^j,  m = 0,  n = 1 } 
fi,(y)  * y + kj,  f^(y)  = Iny  + 

gi,(t.e)  = 0,,  g5(t,0)  = 0,, 

Here  p^^  — Pfi^t  Pj  ~ ^f5* 

The  decision  rule  (4.2)  simplifies  into 

if  ln(p^^/p^5)  > (2  ■n7)/0  + 1/N) 

The  simplicity  of  the  decision  rule  should  be  compared  with  the  corresponding 
complexity  in  using  hypothesis  testing  methods.  Using  the  hypothesis  testing 
approach,  we  cannot  directly  compare  the  normal  and  log  distribution  fits  to  the 
given  data.  All  we  can  do  is  compare  whether  the  normal  distribution  (y,  p^^^)  is 
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a good  fit  to  the  empirical  distribution  of  the  dataset  at  the  95  percent 
significance  level.  Alternatively  we  can  test  whether  the  log  normal  distribu- 
tion (In  y,  is  a good  fit  to  the  emperical  distribution  of  y (I ) , . . . ,y (N) . 

It  is  not  difficult  to  construct  examples  in  which  we  can  find  both  the  normal 
and  log  normal  fits  are  significant  at  the  95  percent  significance  level. 

Example  3:  (Autoregressive  processes)  We  will  compare  2 classes  of  autoregres- 

sive models  of  orders  n^  and  n^  respectively. 

Cj  = {f,  9j,  nj,  nj},  i = 6,7,  > n^ 

f(y)  “ y,  P|:  * empirical  variance  of  y 
Vgg.(t,0)  * (y(t-l),...,y(t~n.)) 

Let  N » n^,  N » n^  and  pj  >>  p^/N,  i = 6,7 
Decision  rule  (4.2)  yields; 


h > '6 

-N  In  Py  - n^ 


i.e.,  CyY  Cg,  if 


ln(p^/p^)  - 

ln(pg/p^) ^ 


n^  In  N > -N  In  Pg  - ng  ln(p^/p^)  - ng  In  N 

(n--n,)[ln  N + ln(p,/p,)] 

-i-i ii— — LJ—  (5.1) 

N • fij 


If  pg  and  pj  are  not  too  far  from  each  other 
ln(pg/p^)  = (pg-p^)Zp^ 

i.e.,  we  should  increase  the  order  of  AR  model  from  ng  to  n^  only  if  the 
fractional  decrease  in  variance  is  greater  than  the  quantity  on  the  RHS  of 
(5.1).  If  ny  =■  ng+l,  N = 200,  (pf/p^)  “ 2 then  RHS  of  (5.1)  “ .03155 
I.e.,  we  should  add  another  AR  term  only  if  the  fractional  decrease  In 
variance  is  at  least  3.15  percent. 

Suppose  we  want  to  use  the  hypothesis  testing  approach.  Then  the  decision 
rule  has  the  following  form: 


^7  r 6 N 


(5.2) 


2k 


where  k is  a threshold  depending  on  ng-n^.N  and  the  significance  level.  It  is 
determined  by  the  fact  that  if  Is  true  then 

N(Cg-C7)/p7(n7~ng)  - F(n^-ng,N) 

For  N > 100,  the  threshold  depends  only  on  (n7-n^) . 

Thus  the  principle  difference  between  rules  (5>l)  and  (5.2)  is  the  absence 
of  the  factor  In  N and  the  presence  of  the  nonlinear  threshold  k in  (5.2). 
Example  k:  Let  C.  denote  class  of  AR  models  of  order  I in  which  nj  ■ I.  Our 

intention  is  to  show  that  the  decision  rule  given  by  hypothesis  testing  (i.e. 
rule  (5.2)  is  intransitive  (i.e.  it  does  not  obey  P2).  Let  P|  denote  the 
residual  variance  of  class  Cj.  Let  N • 100.  Suppose  we  choose  95  percent 
significance  level. 


k(l,  100)  = 3.81*  . kj 

k(l*,  100)  *=  2.38  » k^^ 


We  will  presently  show  that  the  nonlinear  dependence  of  k(nj,n2)  on  n^  is  the 
cause  of  the  intransitivity. 


If  we  compare  C.  and  C.^j,  (5.2)  yields 

N(pj-Pj^j)  ^ kj  choose  Cj 
p > kj  =>  choose 


Suppose  we  compare  C.  and 

N(pj-pj^^)  ^ k^  choose  C. 

choose 


(5.3) 


(5.k) 


Suppose  the  numerical  values  of  P|,  i = 1,2, 
(5.5)  and  (5.6)  which  are  not  inconsistent 


l*k, 


'I 


+ I <-<  (1 


k 

) 


p./p 


i+1 


I. 


I. 


.4 


,5  obey  the  following  relations 


(5.5) 

(5.6) 
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The  decfsion  rule  (5.^)  and  the  left  hand  side  of  inequality  (5.5)  Imply 

(5.7) 

(5.7) 

Decision  rule  (5.3)  and  eq.  (5.6)  Imply  (5.8) 

Cj  ^ ^j+j  » * “ l*...i^  (5.8) 

Repeated  use  of  (5.8)  implies  (5.9) 

c,  Cj  (5.,) 

Eq.  (5.9)  and  (5.7)  mutually  contradict  each  other  showing  the  Intransitivity 
of  the  decision  rule  (5-2). 

Note,  however,  that  the  decision  rule  (5.1)  is  not  intransitive  since  it 
does  not  involve  any  arbitrary  threshold. 

VI.  CONCLUSION 

We  have  developed  a method  of  comparing  different  classes  of  dynamical 
models  using  Bayesian  theory.  The  method  can  handle  a wide  variety  of 
classes  and  is  much  superior  to  the  traditional  methods  of  comparison  like 
hypotheses  testing.  The  method  clearly  shows  the  limitations  of  models  such 
as  polynomial  fits  which  using  a large  number  of  parameters  can  render  the 
residual  variance  zero.  It  clearly  shows  that  such  models  have  no  explanatory 


power. 
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Append  t x I 

We  will  establish  theorem  I In  3 steps.  We  will  first  state  lemma  I which 
gives  an  expression  for  P(Cj|s)  using  only  the  assumptions  (Al)-(AIO).  We  shall 
state  lemma  2.  Using  lemma  I and  lemma  2,  and  the  additional  assumptions  (All)- 
(AI3)«  we  will  prove  theorem  I.  Nexti  we  will  es tab  I ish  lemmas  3»  l«  2 successively, 
lemma  3 being  required  in  the  proof  of  lemma  I. 

Lemma  I ; 

Under  the  assumptions  (AI)-(AIO),  the  posterior  probability  of  P(Cj|c)  has 
the  following  expression 

P(Cj|c)  “ k exp[0.5h.(e)  + 0(l/N)],  i « l,...,r  (I) 

where 


h.  (C)  * 2nX.  - (N-mj)lnej  - nj  In  N 

+ ln(det  Sj/det  S^j)  - In  det  Rj  + 2ln  p(Cj)  + bj, 
b.  = m.  - (f . (I) , . ,f j (mj))Rj ’ (f . (I) , . . . ,f I (mj))^ 

X.  = (l/N)  U In  f,(y)/dy|y.y(t) 

gj  = + (l/(N-m.)(0)''’s'!e 

N T 

e - 0*  - 0QJ  = Sj  I z,(t-l)(f,(t)  - zj(t-l)0Q,) 
t“m  j+l 

-1  N T -1 

S|  * [SqJ  + I Zj(t-I)  zj(t-l)] 

t=mj+l 

Sj  - S,(N-m.) 

^ T 9 

p,  - (l/(N-m.))  I (f,(t)  - zT(t-l)0*)^ 

t-mj+l  ' ' ' 

Zj(t-I)  » Vgg(t,y(t-I) y(t-mj),0) 


(2) 

(3) 

(M 

(5) 

(6) 

(7) 

(8) 

(9) 

(10) 


28 


Lemma  2; 

Under  the  assumptions  (All)  and  (A12)  6j  In  (5)  will  simplify  as  follows 

^i  “ Pi  + (pf,-P,)/(N-mj)  (II) 

where 

PfI  " ’/N  t^fjCt))^  (12) 

Proof  of  Theorem  I : 

Let  us  first  use  the  expression  for  S^j  In  (Ail)  which  simplifies  Sj  in 
(7)  as  follows 

^1  " Soi/(N-m,+l)  (13) 

we  can  use  (13)  to  simplify  the  expression  ln(det  Sj/det  S^j)  occurring  In  (2). 

Using  (8)  and  (13),  we  get  i 

(N-m.) 

In  d.t  S,  - In  d.t  S„|  - In  ,,^1  - In  d,t  S;,, 

■ n In (N-mj/N-mj+1 ) 

= -n/(N-mj)  - 0(1/N)  (14) 

Next,  let  us  use  assumption  (A13).  Then  bj  In  (3)  simplifies  Into  (15) 

"•i  2 

*>1  - (fj(t))  /p^j  (15)  , 

i 

clearly  E[bj |Cj]  « 0.  ’ 

Variance  (bj |Cj]  - 2mj 

Moreover,  (AI3)  implies  in  det  Rj  - m^  In  p^j  (16)  j 

Substituting  (11),  (14)-(16)  in  (2),  yields  theorem  1.  Q.E.D.  i 

I ■ 

We  need  the  following  lemma  3 to  prove  lemma  1.  To  simplify  the  notation,  i 


Lemma  3: 


- (0-0*)V’(e-e*)  + (N-m)6 

Proof  of  Lemma  3;  Let  0 = e*-e 


N 

I (f(t)  - z^(t-I)0)^  + (e-0Q)V*(0-0.) 

t=ttri-l  0 O'  0' 


0: 


I (f(t)  - z^(t-l)0*  + z^(t- l)(0*-0))2 


t=m+l 


+ (0-0*  + 0)^S"'(0-0*  + 0) 

Expanding  the  squares  and  rearranging  terms 

N 


LHS  of  (17)  = [(0*-0)  ( I z(t-i)z^(t-l)  + S"’)(0*-0)J 

t-m+1  0 


+ I (f(t)  - z^(t-I)0*)^  + (0  )S’*0] 

t-m+1  0 


T ** 

+ [2(0*-0)  { I z(t-i)(f(t)  - z^(t-i)0*)-s;;’0}j,  (18) 
t“m+l  0 

Coefficient  of  (0*-0)^  in  (18) 

r N 

- I z(t-l)f(t)  - I z(t-l)z(t-I)0*  - Sl’o 

t-nrf1  t-m+1  0 

N ^ 

' z^(t-l)0Q)  - [ I z(t-1)z^(t-I)  + S’’]0 


“ 0,  by  definition  of  0 in  (6)  and  (7) 


3i 


Substitute  (19)  in  (18)  and  rewrite  R.H.S.  of  (18)  using  definition  of  a in  (5) 
and  (9).  We  get 

LHS  of  (17)  - RHS  of  (17). 

Proof  of  lemma  1; 

Recall  Cj  = (y  (1 ) , . • . ,y  (nij ) },  - {y(mj  + l) y(N)}. 


Let  k = 


We  will  first  compute  p(52U,.P.C) 

p(e2lc,,P,c)  = / dep(52Ur®.P.C)p(0|p.C,5,) 

“ / d0p(52U,,e,p.C)p(elp,C),  by  (A7) 


k / d|e| 


(2irp) 


exp[-l/2p)  I (f(t)  - z^(t-l)0)^] 
t»»m+l 


using  (2.1|)  and  (A9). 


( I (f(t)  - z (t-l)0)  + rearranging  terms 

t=m+l  w u u 


k / d|0( 


exp[-(l/2p)(6-e*)^ 


s"’(0-0*)]  exp[-(l/2p)(N-m)a] 


(27Tp) 


using  lemma  3. 

I ^2 


Now  we  will  integrate  over  p after  multiplying  by  p(p|C)  given  by  (A8) 
p(C2le,.C)  = / dp  p(p|C)p(C2U,,P,C) 

1 /7 

= / dp  “ k ^) r_/i/o..\/ii \oi 


“ 2ka  Ldx  exp[-x^(M-m)B/2)]x^”'"~' (Det  S/Det  S 
, 2 0 
using  the  transformation  p ■ 1/x  , 

- ak{(N-m)6/2]'^'^""’^''^  T [ (N-m)/2]  [Det  S/Det  S^]  ’ (2ir) 

Simplify  the  above  expression  using  the  expression  for  k in  (20)  and  the  standard 

formula  for  r(x)  namely: 

inr(x)  ■ X In  X - X + 1/2  In(2Tr/x), 

N 

2 in  p(C-|CiiC)  ■ 2 in  o + 2 J In  f (t)  - (N-m)  In  B * (N-m) 

t-m+1 

+ In  [Air/ (N-m)]  + ln(det  S/det  S^)  - (N-m)ln2ir,  (23) 

S is  of  the  order  0(1/N).  Hence  we  can  define  S ■ S(N-m),  so  that  S is  0(1). 

det  S - (det  S) (I/N-m)"  (2A) 

By  assumption  (A6) 

m - 

2 In  p(C,|C)  » - m In  2ir  - In  det  R + 2 [in  f ' (t)  - ||(f(l) f(m)||o-l 

' t-I 

(25) 

2 in  p(e|C)  - 2 in  p(C2U,,C)  + 2 In  p(C,|C)  (26) 

N 

-2  I In  f'(t)  - (N-m)  In  6 - (n+1)  In  (N-m) 
t=I 

+ In (det  S/det  Sq)  - In  det  R+m-  ||f(l),...,f  (m)||^-l 

- M In  2ir  - N + 2 Ina  + In(Air),  using  (23)  and  (25), 

(27) 

In  p(C  |c)  » In  p(5|C  ) + In  p(C  ) - In  p(c) 

■ 0.5h  (c)  + terms  independent  of  any  class  + 0(1/N) 

(28) 

where 

N 

h (5)  » 2 J In  f ' (t)  - (N-m)  ing  - n In  N + ln(det  S/det  S.) 

t»I  ° 

- In  det  R +2  In  p(C  ) + (m  - ||f  (1) f (m  )||J-1 
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Proof  of  lemma  2: 


Recall  =(l/N-m)  I (f(t)  - 

t“m+l  ® 


Recall  that 


P =(l/N-m)  I (f(t)  - 2^(t-l)0*)^ 
t=m+l 


(1/N  "Z  (<:">)0Q  - z^(t-l)0)^,  by  definition  of  0 in  (6) 

I z(t-l)z^t-l)0 


t-m+1  — p -v>- 

t“n)+i 

■ 26  I z(t-l)(f(t)  - z^(t-l)0.)/N-m 

t=m+l  0 

Substitute  for  the  coefficient  of  0 in  the  last  term  using  (6).  replace  the  first 
term  by  and  rearrange  the  terms. 

P = Pf  + e (l/N-m  I z(t-l)z^(t-l)  - 2S"'/N-m)0 
t=m+l 

* Pf  + (e)'^S'’0(l  - 2(N-mfl)/N-m) 
or  0^S"’0  = (p^  - p)/(i  + 2/N-m) 

= (Pf  - p)(l  - 2/N-m) 

By  definition  of  $ in  (5), 

6 - p +{ l/N-m) (0)^S"'0 
" p +( l/N-m) (p^  - p)  + 0(1/N) 
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Appendix  2 

We  will  outline  the  proof  of  theorem  2: 

Let  (fj,  g.,  6*(N),  P*(N))  e C.,  1 = 1,2,  be  the  best  fitting  models  for 
the  given  data  in  the  2 classes  and  P*,  etc.,  being  defined  in 

Section  III.  We  will  assume  that  6.(N)  and  p.(N)  tends  to  6.  and  p^  as  N tends 
to  infinity  with  probability  one.  In  view  of  the  assumptions  (Al)-(A7),  the 
models  in  each  class  are  identifiable  given  the  class.  Hence  g2»  02*  ^2^ 

the  asymptotically  recovered  model  is  the  exact  representation  of  y(‘)  stated  in 
the  theorem.  Note  that  by  definition,  y(‘)  does  not  obey  the  model  (f j , gj , 9j , 
Pj).  Rather,  this  model  is  the  best  fitting  model  in  Cj  for  the  semi  infinite 
data  set  {y (1 ) ,y (2) , . . . }. 

Recall  that 

1nIp(C2lc(N)/p{C, 1C(N))]  - 0.5lh2(C(N))  - h,(C(N))]  (1) 

Cases  (i)  and  (ii) 

Without  any  loss  of  generality,  we  can  set  fj(y)  = f2(y)  “ y + 

yj(tjt-l)  * one  step  ahead  predictor  of  y(t)  based  on  y(t-j),  j ^ 1 suggested 
by  the  model  (fj,  g^,  §j,  pj) 

= gj (t,9)  - kj  (2) 

E[y(t) ly(t-j),  j i I]  = g2(t,§2)  " kj 

By  definition  of  expectation  and  normality  of  y 

E[y(t)  - 9,(t|t-l))2]  . 9^(tlt-l))2] 

1 .e. , Pj  ^ P2 

At  this  point,  we  will  discuss  the  cases  (i)  and  (ii)  separately. 

Case  (i):  The  structure  of  g^  and  g2  mentioned  in  this  case  imply  (3) 


L'm  >n[P(C2|5(N))/P(C,U(N))] 


Urn  0.5 


(h2(5(N))  - (C(N)) 

inN 


= Urn  0.5  {-N  In  p,  + N In  p,  + (n,-n,)ln  N + 0(l)} 

N-«o  InN  ^ ^ 

= Urn  0.5  {-N  In  p,  + N In  p,  + (n,-n,)ln  N + 0(1)},  since  p,  -►  p, 
N InN  ^ I I ^ II 

$2  -►  P2  w.P.  1 . 

“ (nj-n2)/2,  by  (3) 

or  (P(C2|C(N))/P(C,|?(N))’/’"'^  -v  exp(^^). 

In  Case  (i i ) : 

P,  > P2 

or  p^  = p2  exp  [kj],  kj  >0 

Urn  1 1n[P(C2U(N))/P(C,  U(N))]  = Lim  (In  p,  - In  P2)  = k > 0, 

N-»« 

or  (P(C2U(N))/P(cJc(N))’^^  -*■  exp[k^]  A k2  > 1 
Case  (ill): 

Without  any  loss  of  generality  we  can  set  f j (y)  = y + kj 
Define  y^(t|t-l)  as  in  (2). 

By  definition 

P,  A E[(y(t)  - y,  (t|t-1))^|y(t-j),j  ^1] 

> E[{y(t)  - E(y(t))|y(t-j).j  > 1 ) }^ |y (t-j ) . j > 1]  (4) 

Let  ?2(tjt-l)  = E[f (y(t)) |y(t-j) ,j  ^ 1] 

Since  the  conditional  distribution  of  f(y(t))  given  y(t-j),j  ^ i is  normal, 
we  can  use  condition  (B1)  and  rewrite  it  as  follows: 


E[(y(t)  - E(y(t)  |y(t-j),j  ^ D)  |y(t-j),j  ^ 1] 

iE[{f(y(t))  - f2(‘|t-l))^|y(t-j),j  > l]/exp[2E(ln  f (y(t)  |y(t-j) , 
Jl’)]  (5) 

Using  (4) 

In  p,  > in  E[{y(t)  - E(y(t) |y(t-j).j  ^ 1) }^|y(t-j) ,j  ^ 1 
^ In  2E(in  f (y(t)) |y (t-j) , j ^ I)] 
using  (5)  and  definition  of 

> In  p^  2E(in  f (y(t))  (6) 

Urn  i 1n[P(C2|E(N))/P(C,U(N))] 

= Urn  [0.5  [h,(E(N))  - h,(E(N))]/N] 

N->«  ^ ' 

N 

= Litn  [2(1/N)  I In  f'(y(t))  - in  p,  + In  p,  + 0(1og  N/N)] 

t=l  2 -^1 

> kj  > 0.  by  (6) 

•r 

((P(C2|C(N))/P{C Je(N)  ~ exp[k_],  for  large  N 


